Introduce async scheduler implementation with mixin pattern#941
Introduce async scheduler implementation with mixin pattern#941GOavi101 wants to merge 1 commit intotorch-spyre:mainfrom
Conversation
|
👋 Hi! Thank you for contributing. We also recommend installing prek and configuring it to check your code before every local commit. |
1a3ecbb to
b0e8e83
Compare
| SchedulerOutput = None | ||
|
|
||
| logger = init_logger(__name__) | ||
| from vllm_spyre.v1.core.scheduler_impl import ( |
There was a problem hiding this comment.
@GOavi101 it looks like most of this file has been deleted and moved to scheduler_impl. Can you put the implementation back in this file so that reviewers can see what's changed?
There was a problem hiding this comment.
Thanks, I've looked through the tests but I'll wait to review the code changes until after this diff is in nicer shape- I don't really want to try to recreate the diff myself 😉
b0e8e83 to
d71cfb3
Compare
| return EMPTY_MODEL_RUNNER_OUTPUT | ||
| cached = self._last_execute_model_output | ||
| self._last_execute_model_output = None | ||
| return cached if cached is not None else EMPTY_MODEL_RUNNER_OUTPUT |
There was a problem hiding this comment.
Ideally we would actually run the sampling here - see related comment on the structured output PR: #903 (comment)
I'm fine with leaving this as-is and then fixing it to work with both async scheduling and structured outputs in a followup. Issue opened here: #947
There was a problem hiding this comment.
Agreed, thanks for opening the issue. I've added a TODO(#947) comment pointing to it so it's tracked directly in the code.
| Key behaviours under test: | ||
| - _is_async_scheduler() correctly identifies async vs sync instances | ||
| - PoolingSpyreMixin.schedule() applies warmup-shape constraints in both modes | ||
| - ChunkedPrefillSpyreMixin.schedule() bypasses Spyre constraints in async mode |
There was a problem hiding this comment.
This statement seems incorrect- we definitely can't just bypass spyre constraints because there are hard limits to what we can run on the cards. What's really going on?
There was a problem hiding this comment.
Correct, nothing is bypassed — that docstring was wrong. same constraints apply in both modes. only async-specific code is a stale ongoing_prefills cleanup needed because _update_after_schedule speculatively advances num_computed_tokens before update_from_output() confirms it. fixed the docstring.
| is_pooling=True, | ||
| ) | ||
| # Set as string path for vLLM's resolution (matches upstream behavior) | ||
| # Only convert to string if it's not already a string |
There was a problem hiding this comment.
a class should be fine to pass here though, what goes wrong?
There was a problem hiding this comment.
Nothing goes wrong — you're right. SchedulerConfig.scheduler_cls is typed str | type | None and get_scheduler_cls() handles a class directly. string conversion was unnecessary. removed it in the latest push.
| # The mixin's pre-filter pattern is not safe under that run-ahead scenario. | ||
| # For TP=1 (UniProcExecutor), futures are immediately done so it's safe. | ||
| if parallel_config.world_size > 1: | ||
| scheduler_config.async_scheduling = False |
There was a problem hiding this comment.
Interesting- if we wanted to support this feature then it would likely need to work with TP=4 which is how we run most models. I thought this was only incompatible with pipeline parallel upstream - does it also not work with tensor parallel?
There was a problem hiding this comment.
The fix is SpyreMultiprocExecutor — a thin MultiprocExecutor subclass that overrides max_concurrent_batches to return 1 instead of 2. This forces the engine to use the simpler step() path (strictly schedule → execute → update) rather than step_with_batch_queue, which was the only thing that broke TP>1.
Spyre's forward pass is synchronous, so there's no compute/schedule overlap to lose. The AsyncScheduler base class and its _update_after_schedule TTFT benefit are still fully active — we just removed the run-ahead that its state tracking couldn't handle.
So TP=1, TP=2, and TP=4 should all work with async scheduling now. Not a blocker.
what do you think?
There was a problem hiding this comment.
That doesn't quite line up with my understanding- IIUC the step_with_batch_queue method is what works with the speculative scheduling: The engine runs the scheduler again while the model is running, assuming that the requests in the batch will continue.
Spyre's forward pass is synchronous, so there's no compute/schedule overlap to lose
I don't quite understand this either- the multiproc executor is definitely async, it broadcasts an RPC to the workers to run the model and the engine gets back a future that it waits on. step_with_batch_queue queues up that future so that it can speculatively schedule the next pass.
This TP=1 profile shows the scheduler running in between the model forward passes, the goal with async scheduling is to get the scheduler running for the next step during the model forward pass instead:
The AsyncScheduler base class and its _update_after_schedule TTFT benefit are still fully active — we just removed the run-ahead that its state tracking couldn't handle.
So TP=1, TP=2, and TP=4 should all work with async scheduling now. Not a blocker.
Based on the above, my understanding is that the run-ahead state is the whole point and we won't gain any performance benefit from this unless we support it, so this is a blocker. Is there something else I'm missing?
There was a problem hiding this comment.
You're right, thanks for the correction. I'll fix this — snapshot the mixin's mutable state (ongoing_prefills, tkv, previous_step_was_prefill) before delegating to super().schedule() so the run-ahead second schedule() call sees consistent state, and remove SpyreMultiprocExecutor. That way TP≥2 gets the full async scheduling benefit.
|
Thanks @GOavi101! A few notes:
|
1bd875b to
2246d48
Compare
|
After vLLM 0.14, the async scheduler is enabled by default. All the tests below are running using the async scheduler. To run with the synchronous scheduler: |
7b0a718 to
fb7ee62
Compare
Replace _create_pooling_scheduler() and _create_chunked_prefill_scheduler() factory functions with PoolingSpyreMixin and ChunkedPrefillSpyreMixin classes. Each mixin uses _is_async_scheduler() (isinstance check) to detect the concrete base class at runtime and adjust behaviour accordingly, instead of capturing is_async via a closure variable. Concrete classes use simple multiple inheritance: class PoolingSpyreScheduler(PoolingSpyreMixin, Scheduler): pass class AsyncPoolingSpyreScheduler(PoolingSpyreMixin, AsyncScheduler): pass class ChunkedPrefillSpyreScheduler(ChunkedPrefillSpyreMixin, Scheduler): pass class AsyncChunkedPrefillSpyreScheduler(ChunkedPrefillSpyreMixin, AsyncScheduler): pass Side effects: - __module__/__name__/__qualname__ fixup blocks removed (no longer needed) - _async_warning_logged flag removed (debug log emitted each call is fine) - TYPE_CHECKING import removed (unused after refactor) Signed-off-by: Avishek Goswami <avishek.goswami@ibm.com>
fb7ee62 to
c5db31a
Compare
Description
Introduce async scheduler implementation with mixin pattern for cleaner architecture.
New Implementation (mixins)
PoolingSpyreMixinandChunkedPrefillSpyreMixinclasses_is_async_scheduler()(isinstance check)class PoolingSpyreScheduler(PoolingSpyreMixin, Scheduler):class AsyncPoolingSpyreScheduler(PoolingSpyreMixin, AsyncScheduler):class ChunkedPrefillSpyreScheduler(ChunkedPrefillSpyreMixin, Scheduler):class AsyncChunkedPrefillSpyreScheduler(ChunkedPrefillSpyreMixin, AsyncScheduler):Related Issues
Test Plan
tests/v1/core/test_async_scheduler.py(16 tests):TestIsAsyncScheduler: Verifies_is_async_scheduler()detection (4 tests)TestPoolingSpyreMixinSchedule: Tests warmup-shape constraints in sync/async modes (4 tests)TestChunkedPrefillSpyreMixinSchedule: Verifies constraint bypass in async mode (3 tests)TestChunkedPrefillSpyreMixinUpdateFromOutput: Tests scheduler output filtering in async mode (5 tests)Checklist
bash format.sh)Signed-off-by:line (DCO compliance)